TXT File Extraction
This document provides comprehensive documentation for TXT file contact extraction capabilities within the Bulk Messaging System. It focuses on the regex-based phone number detection algorithm, multi-separator splitting logic, name extraction when names are combined with phone numbers, supported TXT formats, mixed format handling, edge cases, and fallback parsing strategies.
The TXT extraction functionality is implemented in two primary locations:
A Flask-based API service that handles file uploads and contact extraction
Standalone Python utilities for direct command-line usage
app.py"] EXTRACT["Contact Extraction
extract_contacts.py"] MANUAL["Manual Numbers Parser
parse_manual_numbers.py"] VALIDATE["Phone Validator
validate_number.py"] end subgraph "External Dependencies" PANDAS["pandas"] OPENPYXL["openpyxl"] XLRD["xlrd"] FLASK["flask"] CORS["flask-cors"] end API --> EXTRACT API --> MANUAL API --> VALIDATE EXTRACT --> PANDAS API --> PANDAS API --> OPENPYXL API --> XLRD API --> FLASK API --> CORS
Diagram sources
Section sources
The TXT extraction system consists of several key components working together:
Phone Number Cleaning Algorithm#
The core phone number cleaning function performs systematic normalization:
Removes common separators (hyphens, spaces, parentheses, periods)
Strips non-digit characters except plus signs
Handles country code detection and normalization
Validates length constraints (7-15 digits)
TXT File Processing Pipeline#
The TXT extraction follows a multi-stage approach:
Line-by-line processing with UTF-8 encoding
Multi-separator splitting using comma, semicolon, tab, and pipe delimiters
Pattern-based phone number detection using regex
Fallback extraction when separators are absent
Name extraction from remaining parts
Fallback Parsing Strategies#
When initial parsing fails, the system employs progressive fallback mechanisms:
Separator-based splitting with multiple delimiter support
Whole-line regex matching for phone numbers
Name extraction from remaining text segments
Graceful error handling and empty line skipping
Section sources
The TXT extraction architecture implements a layered approach with robust error handling and fallback mechanisms.
Diagram sources
TXT Extraction Algorithm#
The TXT extraction algorithm implements sophisticated pattern matching and parsing logic.
[, ; \\t |]"] SplitParts --> FindPhone["Find phone candidates
using regex pattern"] FindPhone --> HasPhone{"Phone candidate found?"} HasPhone --> |Yes| ExtractName["Extract name from remaining parts"] HasPhone --> |No| FallbackRegex["Apply fallback regex
on entire line"] FallbackRegex --> FallbackFound{"Fallback match found?"} FallbackFound --> |Yes| ExtractFromLine["Extract from whole line"] FallbackFound --> |No| NextLine ExtractFromLine --> CleanNumber["Clean phone number"] ExtractName --> CleanNumber CleanNumber --> ValidateLength{"Validate length
(7-15 digits)"} ValidateLength --> |Valid| AddContact["Add to contacts list"] ValidateLength --> |Invalid| NextLine AddContact --> NextLine NextLine --> End([End])
Diagram sources
Phone Number Detection Patterns#
The system uses sophisticated regex patterns for phone number identification:
Primary Detection Pattern#
The main pattern [\d+\-\(\)\s]{7,} identifies phone numbers by:
Matching digits (
\d+)Including plus signs (
+)Allowing hyphens (
\-)Permitting parentheses (
\(\))Supporting spaces (
\s)Requiring minimum 7 characters for validation
Fallback Detection Pattern#
The fallback pattern [\+]?[\d\-\(\)\s]{7,} handles:
Optional leading plus sign
Flexible digit and separator combinations
Whole-line matching when separators are absent
Multi-Separator Splitting Logic#
The system supports four primary separators with equal precedence:
regex test"} Part2 --> Check2{"Looks like phone?"} Part3 --> Check3{"Looks like phone?"} PartN --> CheckN{"Looks like phone?"} Check1 --> |Yes| Phone1["Select as phone candidate"] Check2 --> |Yes| Phone2["Select as phone candidate"] Check3 --> |Yes| Phone3["Select as phone candidate"] CheckN --> |Yes| PhoneN["Select as phone candidate"] Check1 --> |No| Name1["Select as name candidate"] Check2 --> |No| Name2["Select as name candidate"] Check3 --> |No| Name3["Select as name candidate"] CheckN --> |No| NameN["Select as name candidate"]
Diagram sources
Name Extraction Process#
When names are combined with phone numbers, the system implements intelligent extraction:
Mixed Format Handling#
The algorithm prioritizes:
First phone candidate: Selected when multiple phone-like segments exist
First non-empty segment: Used as name when no clear phone candidate exists
Fallback extraction: When separators are absent, the system extracts from the entire line
Name Candidate Selection#
Diagram sources
Supported TXT Formats and Examples#
Standard Separated Format#
John Doe,123-456-7890
Jane Smith;john.doe@gmail.com|+1-555-0123
Bob Wilson|+44 20 7946 0958
Alice Brown,555.123.4567
Mixed Format Handling#
+1-555-0123 John Smith
5551234567 Jane Doe
+442079460958 Bob Wilson
Edge Case Examples#
John Doe,123-456-7890,,Extra Field
123-456-7890
Jane Smith
Error Recovery Mechanisms#
The system implements comprehensive error handling:
File Processing Errors#
UTF-8 encoding enforcement
Graceful handling of unreadable files
Empty file and directory handling
Parsing Failures#
Line-by-line processing with individual error isolation
Empty line skipping
Partial parsing continuation on errors
Phone Number Validation#
Length validation (7-15 digits)
Format normalization
Country code detection and correction
Section sources
The TXT extraction system relies on several key dependencies:
Diagram sources
External Dependencies Impact#
pandas: Enables structured data processing for CSV/Excel files
openpyxl/xlrd: Provides Excel file format support
flask/flask-cors: Powers the web API interface
werkzeug: Handles file uploads and security
Section sources
The TXT extraction system is optimized for efficiency and scalability:
Algorithm Complexity#
Time Complexity: O(n × m) where n is number of lines and m is average parts per line
Space Complexity: O(k) where k is number of valid contacts extracted
Memory Usage: Linear with respect to file size
Optimization Strategies#
Single-pass line processing
Early termination on empty lines
Minimal regex operations per line
Efficient string operations for phone number cleaning
Scalability Factors#
File size limitations (16MB max upload)
Memory constraints for large files
Regex compilation caching
Streaming file processing
Common Issues and Solutions#
Phone Number Not Detected#
Symptoms: Phone numbers appear as empty or invalid Causes:
Numbers shorter than 7 digits or longer than 15 digits
Unrecognized separators or formatting
Leading zeros in international numbers
Solutions:
Ensure numbers meet length requirements
Use recognized separators (spaces, hyphens, parentheses)
Include country codes for international numbers
Mixed Format Problems#
Symptoms: Names incorrectly extracted or phone numbers missed Causes:
Ambiguous separators causing misinterpretation
Names containing phone number patterns
Missing separators in lines
Solutions:
Use consistent separator usage
Place phone numbers first when mixing formats
Include explicit separators between name and number
File Encoding Issues#
Symptoms: Characters appear corrupted or parsing fails Causes:
Non-UTF-8 file encoding
Special characters not properly handled
BOM (Byte Order Mark) interference
Solutions:
Save files in UTF-8 encoding
Remove BOM if present
Verify character encoding compatibility
Performance Issues#
Symptoms: Slow processing for large files Causes:
Very large file sizes exceeding limits
Complex regex patterns
Memory constraints
Solutions:
Split large files into smaller chunks
Optimize regex patterns
Monitor memory usage during processing
Error Codes and Messages#
The system provides structured error reporting:
File not found errors
Unsupported file type errors
Processing exceptions with detailed messages
Validation failures with specific reasons
Section sources
The TXT file extraction system provides robust, flexible contact processing with sophisticated regex-based phone number detection and intelligent name extraction. Its multi-layered approach ensures reliable parsing across various formats while maintaining strong error handling and performance characteristics. The system successfully handles edge cases, mixed formats, and provides comprehensive fallback mechanisms for maximum compatibility with diverse contact data sources.
The implementation demonstrates best practices in:
Progressive parsing with multiple fallback strategies
Comprehensive error handling and recovery
Efficient regex pattern matching
Flexible separator support
International phone number normalization
This foundation enables reliable bulk messaging operations while maintaining data integrity and user experience across different contact data formats.